Improving Pointwise Mutual Information (PMI) by Incorporating Significant Co-occurrence
نویسنده
چکیده
We design a new co-occurrence based word association measure by incorporating the concept of significant cooccurrence in the popular word association measure Pointwise Mutual Information (PMI). By extensive experiments with a large number of publicly available datasets we show that the newly introduced measure performs better than other co-occurrence based measures and despite being resource-light, compares well with the best known resource-heavy distributional similarity and knowledge based word association measures. We investigate the source of this performance improvement and find that of the two types of significant co-occurrence corpus-level and document-level, the concept of corpus level significance combined with the use of document counts in place of word counts is responsible for all the performance gains observed. The concept of document level significance is not helpful for PMI adaptation.
منابع مشابه
Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words
This paper presents a new corpus-based method for calculating the semantic similarity of two target words. Our method, called Second Order Co-occurrence PMI (SOC-PMI), uses Pointwise Mutual Information to sort lists of important neighbor words of the two target words. Then we consider the words which are common in both lists and aggregate their PMI values (from the opposite list) to calculate t...
متن کاملExternal Evaluation of Topic Models
Topic models can learn topics that are highly interpretable, semantically-coherent and can be used similarly to subject headings. But sometimes learned topics are lists of words that do not convey much useful information. We propose models that score the usefulness of topics, including a model that computes a score based on pointwise mutual information (PMI) of pairs of words in a topic. Our PM...
متن کاملWhen the Whole Is Less Than the Sum of Its Parts: How Composition Affects PMI Values in Distributional Semantic Vectors
Distributional semantic models, deriving vector-based word representations from patterns of word usage in corpora, have many useful applications (Turney and Pantel 2010). Recently, there has been interest in compositional distributional models, which derive vectors for phrases from representations of their constituent words (Mitchell and Lapata 2010). Often, the values of distributional vectors...
متن کاملImproving a Fundamental Measure of Lexical Association
Pointwise mutual information (PMI), a simple measure of lexical association, is part of several algorithms used as models of lexical semantic memory. Typically, it is used as a component of more complex distributional models rather than in isolation. We show that when two simple techniques are applied—(1) down-weighting co-occurrences involving lowfrequency words in order to address PMI’s so-ca...
متن کاملSegmented Spoken Document Retrieval Using Word Co-occurrence Information
This paper shows several approaches for NTCIR-11 SpokenQuery&Doc [1]. This paper proposes several schemes to use word co-occurrence information for spoken document retrieval. Automatic transcriptions of spoken documents usually contain mis-recognized words, making the performance of spoken document retrieval signi cantly decrease. The cosine similarity to measure a document similarity must be i...
متن کامل